Configuration management in distributed data systems

ABSTRACT

Systems and methods for managing configurations of data nodes in a distributed environment A configuration manager is implemented as a set of distributed master nodes that may use quorum-based processing to enable reliable identification of master nodes storing current configuration information, even if some of the master nodes fail. If a quorum of master nodes cannot be achieved or some other event occurs that precludes identification of current configuration information, the configuration manager may be rebuilt by analyzing reports from read/write quorums of nodes associated with a configuration, allowing automatic recovery of data partitions.

RELATED APPLICATIONS

This application claims benefit under 35 U.S.C. 119(e) of U.S.Provisional Application Ser. No. 61/108,076, filed on Oct. 24, 2008, theentire content of which is incorporated herein by reference.

BACKGROUND

Advances in computer technology (e.g., microprocessor speed, memorycapacity, data transfer bandwidth, software functionality, and the like)have generally contributed to increased computer application in variousindustries. Ever more powerful server systems, which are oftenconfigured as an array of servers, are commonly provided to servicerequests originating from external sources such as the World Wide Web,for example.

As the amount of available electronic data grows, it becomes moreimportant to store such data in a manageable manner that facilitatesuser friendly and quick data searches and retrieval. Today, a commonapproach is to store electronic data in one or more databases. A typicaldatabase can be referred to as an organized collection of informationwith data structured such that a computer program can quickly search andselect desired pieces of data, for example. Moreover, in suchenvironments a federation refers to a group of organizations or serviceproviders that have built trust among each other and enable sharing ofuser identity information amongst themselves.

With the advent of distributed computing models such as web services,there are increased interdependencies among entities such as a ServiceProviders (SPs). Accordingly, a current trend is to focus oninter-organization and interdependent management of identity informationrather than identity management solutions for internal use. Such can bereferred to as federated identity management. In general, federatedidentity is a distributed computing construct that recognizes thatindividuals move between corporate boundaries at an increasinglyfrequent rate. Practical applications of federated identities arerepresented by large multinational companies that are required to manageseveral heterogeneous systems at the same time.

In such distributed systems, various challenges exist for propermanagement and configuration/reconfiguration of nodes. For example,individual nodes can fail randomly, which can cause data loss whensuitable contingencies are not put into place. Likewise, replicated datais often required to be moved around the system, which can furthercreate reliability issues and consistency problems.

Moreover, reliability issues can further complicate when data related toan over all management of such nodes are subject to loss due to failureof a centralized cache for example.

SUMMARY

Data in a transactional data store may be replicated across manycomputers or other devices acting as nodes in a distributed system, suchas for redundancy or high availability purposes. However, while thedistributed system may provide a high guarantee of availability, theunderlying computers on which the transactional data store is managedand replicated may themselves be unreliable.

The distributed system may be managed by a configuration manager thatstores configuration information to enable identification of a data nodeor data nodes that store a current replica of the data store, or somepartition of it. The configuration manager may be implemented as a setof master nodes that each maintain a copy of the configurationinformation. One of the master nodes in the set of master nodes may bedesignated as the primary master node for the configuration manager andresponds to requests for configuration information and controlsreconfiguration of the data nodes.

Quorum-based processing may be used to identify the primary master nodeas well as to determine whether a master node containing configurationinformation contains the current configuration information. Even if somemaster nodes that make up the configuration manager fail, if sufficientmaster nodes to identify a master node containing the currentconfiguration information are available, reliable configurationinformation can be provided. In some embodiments, a sufficient number ofmaster nodes is determined based on information stored in the masternodes themselves.

In some embodiments, each master node stores, in conjunction withconfiguration information, information identifying the set of nodes thatmakes up the configuration manager at the time that configurationinformation was stored. Because the configuration information is notcommitted in any master nodes unless a quorum of the set of nodesintended to be a new configuration can commit, if a quorum of the nodesin such a set agree that they contain the current configuration, theidentified set can reliably taken as the current configuration. When aset of master nodes identifying the same group of master nodes as thecurrent configuration manager represents a quorum of that group, the setcan reliably be determined as the current set of nodes making up theconfiguration manager. Even if some of the master nodes making up theconfiguration manager fail, so long as a quorum of the master nodesstores consistent information identifying the current set ofconfiguration information about the configuration manager, areconstruction component can reliably identify a master node from whichto obtain a replica of the current configuration information. Thereconstruction component can also identify the master node designated asthe primary master node in the current set and determine whether thatprimary master node is available. If primary master node has failed, anew primary master node can be designated and possibly additional masternodes can be designated as part of the set of master nodes storingcurrent configuration information.

In scenarios in which a quorum of master nodes cannot be identified orthere is some other catastrophic failure, the reconstruction componentmay reconstruct the configuration manager from information stored in thedata nodes.

To reconstruct the configuration manager, a new primary master node maybe selected by a process that identifies a node as the primary masternode in a way that all master nodes recognize the same master node asthe primary master node. In some embodiments, this process may involvecommunication among the primary master nodes, which may be managed bycomponents of the database system that facilitate communication amongthe nodes.

In some embodiments, the communication among the master nodes may resultin configuring the master nodes into a token ring in which a token ispassed from node to node, assigning ordered positions to the masternodes. The new primary master node is selected as the master node withposition 0. The token ring may also be used during system operation toidentify failures in any master node will be identified by nodes in thetoken ring adjacent a failed node when the adjacent nodes cannotexchange a token with the failed node.

Once a primary master node is established, configuration information maybe reconstructed from information stored in the data nodes. The datanodes in the distributed system may provide messages to one or more ofthe master nodes (e.g., the primary master node) indicating the datanodes, including a primary data nodes, storing a replica of the currentconfiguration.

The messages from the data nodes are compared to identify a quorum ofdata nodes that report the same current configuration. When a set ofdata nodes identifying the same group of data nodes as storing thecurrent configuration represents a quorum of that group, the set canreliably be determined as the set of data nodes making up the currentconfiguration. Messages can be processed for each partition of the dataset stored in the data nodes, allowing the configuration manager to berebuilt with configuration information identifying the nodes storing acurrent replica of each partition, including a primary node for thepartition.

The foregoing is a non-limiting summary of the invention, which isdefined by the attached claims.

BRIEF DESCRIPTION OF THE DRAWINGS

The accompanying drawings are not intended to be drawn to scale. In thedrawings, each identical or nearly identical component that isillustrated in various figures is represented by a like numeral. Forpurposes of clarity, not every component may be labeled in everydrawing. In the drawings:

FIG. 1 is a block diagram of a reconstruction component according tosome embodiments of the invention;

FIG. 2 illustrates an exemplary partitioning and a rebuild associatedwith a plurality of nodes according to some embodiments of theinvention;

FIG. 3 is a block diagram of a system with a configuration componentthat can be reconstructed according to some embodiments of theinvention;

FIG. 4A illustrates a methodology of setting a new configurationaccording to some embodiments of the invention;

FIG. 4B is a flow diagram of a method for managing a distributed systemusing a master cluster according to some embodiments of the invention;

FIG. 4C is a flow diagram of a method for rebuilding configurationinformation for a partition of the database according to someembodiments of the invention;

FIG. 5 illustrates an exemplary environment for implementing variousaspects of some embodiments of the invention; and

FIG. 6 is a schematic block diagram of a sample computing environmentthat can be employed for data retrieval according to some embodiments ofthe invention;

DETAILED DESCRIPTION

The inventors have recognized and appreciated that improvements in costand reliability of distributed database systems may be achieved throughan improved configuration manager that maintains configurationinformation for a distributed data store.

The inventors have further recognized and appreciated that distributedsystems frequently have a need to offer high availability of the data,even as the underlying computing machines used to implement thedistributed system may themselves occasionally fail. This applies notonly to the transactional data maintained in partitions by data nodes(also referred to as replica nodes) on the distributed system, but alsoto configuration information stored on master nodes, which relates thepartitions of the data store to the data nodes on which the data isreplicated.

Accordingly, in some embodiments of the invention, the distributedsystem has multiple data nodes for storing data and multiple masternodes for storing configuration information. Data may be stored inpartitions, each of which may be replicated by a set of data nodeswithin the distributed system. Even though the data nodes replicatingthe partition are unreliable, transactional consistency is assured usingquorum-based processing. If a quorum of the data nodes in a currentconfiguration agree on the current configuration, a data node that ispart of that quorum can provide a reliable copy of the data for thepartition.

Each partition may be periodically reconfigured to utilize a differentset of data nodes or change the partition's primary node.Reconfiguration may be done, for example, in response to changes in thedistributed system such as the loss or addition of data nodes.

To facilitate quorum-based identification of the data nodes in apartition at any given time, operations that establish or reconfigurethe current configuration may also be implemented using quorum-basedprocessing. When a new configuration for a partition is to beestablished, the data nodes in that new configuration do not “commit”the activation command until a quorum of the nodes in the newconfiguration respond with an indication that they are able to committhe command. Similarly, when a current configuration is to bedeactivated, the nodes in the current configuration do not commit thedeactivate command until a quorum of nodes in the current configurationrespond that they can commit the deactivate command. In this way, when areconfiguration occurs, there will be a quorum of nodes with the newconfiguration and not a quorum of nodes with the old configuration.

Regardless of the process by which the nodes are reconfigured, aconfiguration manager may store configuration information for thepartitions of data nodes. Additionally, the configuration manager mayexecute programs that select a set of data nodes to store eachpartition. Though, the manner in which sets of data nodes are selectedto store a partition is not critical to the invention and any suitablemechanism may be used. The configuration manager may also be implementedas a set of nodes, in exemplary embodiments referred to herein as masternodes.

The set of master nodes, also referred to as the master cluster,maintains configuration information that identifies the data nodesstoring replicas of each partition and other information that may beused in accessing that data, such as an identity of the primary node ineach partition. The set of master nodes constituting the master clustermay also change from time to time. Quorum-based processing may also beused to identify the current configuration of the master cluster. Aswith data nodes, changes to the current configuration of the mastercluster may be performed with quorum-based operations.

In some embodiments, the configuration information in the master clustermay be stored as a database mapping the partitions to the data nodes onwhich they are replicated. In some exemplary embodiments describedherein, such a database is described as a global partition map (GPM).The GPM may further include information about the status of eachpartition, such as which nodes are alive, which node is the primary nodefor each partition, the configuration version of each partition andwhether the node is currently involved in a process of reconfiguring thenodes on which the partition is replicated.

The GPM may be treated as a partition stored on the master cluster. Oneof the nodes in the master cluster may be designated as the primary nodefor the GPM partition. Master nodes, like the data nodes of thedistributed system in general, may be individually unreliable andoccasionally fail. So long as a quorum of master nodes agrees on thecurrent configuration of the master cluster, any master node within thatquorum can provide reliable information on the GPM.

However, hardware failures or other events may cause a loss of integrityof the master cluster. In response, the master cluster may be rebuilt torestore that integrity. When the integrity of the master cluster islost, the master cluster may be rebuilt, including regenerating the GPM,from information stored by the data nodes of the distributed system.

The master cluster may be rebuilt in response to a triggering event,such as when the primary master node is lost or a quorum of the masternodes is cannot be accessed to verify that a particular master node fromwhich a GPM is available contains an accurate replica of the currentGPM. In some embodiments, when a replica for any partition sends amessage, it includes a configuration version for the partition which canbe cross checked with the GPM. An inconsistency between the GPM and theconfiguration version indicated by the message may also triggerreconfiguration of the master cluster. Though, the specific events thatare regarded as triggering events is not critical to the invention. Forexample, in some embodiments, loss of the primary node may notnecessarily trigger rebuilding of the master cluster. If a quorum ofmaster nodes in the current configuration is available, even though theprimary node is not, it may be possible to replace the primary masternode with another node that contains a replica of the currentconfiguration. Accordingly, it should be appreciated that the triggerevents described herein are exemplary and different or additional eventsmay trigger a rebuild.

Regardless of the conditions under which a rebuild is to be initiated, arebuild may entail erasing from all of the master nodes the currentconfiguration information and regenerating that information based onmessages received from data nodes. A new primary master node also may beselected as part of the rebuild. Other master nodes may be designated assecondary master nodes in the new configuration of the master clusterand replicas of the current configuration information, derived from themessages from the data nodes, can be stored in both the primary andsecondary master nodes.

In some embodiments, the selection of secondary nodes may be made byprogramming on the primary master node. Additionally, the primary masternode may collect and process messages from the data nodes to derive thecurrent GPM. Though, in other embodiments, an external component mayoperate as a configuration controller that designates the primary andsecondary nodes and collects messages from the data nodes.

Selection of a primary master node may entail considerations that aredifferent than for the selection of secondary master nodes. In theembodiments described, processing is employed such that a single masternode is designated as the primary master node and all other master nodesrecognize that master node as the primary. In some embodiments, suchprocessing may entail configuring the master nodes in a token ring. Themaster nodes in the token ring may have an order, such as based on theorder in which they are passed the token around the ring. Based on thisorder, a master node at a predetermined location in the ring may bedesignated as the new primary master nodes, allowing a master node to beuniquely identified. In some embodiments, the new primary master node isselected as the master node with a token value of 0. However, anysuitable mechanism may be used to uniquely identify a master node in thetoken ring. Also, any other suitable approach, whether or not a tokenring is established, may be used to uniquely identify a master node asthe primary master node in the new master cluster.

Before rebuilding the GPM, any existing data related to the map may bedeleted by members of the master cluster. This process may be performedby deleting the GPM from all the master nodes of the prior mastercluster and/or all the master nodes to make up the new cluster or all ofthe master nodes, or in any other suitable way. To rebuild the GPM, thenodes in the distributed system may each provide a message to one ormore of the master nodes (e.g., the primary master node) indicatinginformation from which the master nodes can reconstruct the GPM, such asthe partition replicated by the node, a configuration version of thepartition, and the set of data nodes for the partition. The messagessent by the nodes to the master cluster may be automatically sent on aperiodic basis, sent in response to a request from the master cluster orother device acting as a reconfiguration controller, or sent as part ofa system reset. Though, any suitable mechanism may trigger the nodes tosend the reporting message to the master cluster. In some embodiments,the messages may be generated by the nodes using their own respectivelocal partition maps. If a data node replicates more than one partition,the node may provide the above information for each partition.

The messages from the data nodes are received by the master cluster(e.g., the primary master node) and processed to identify a currentversion of the configuration for each partition. The configurationversion of a partition may be identified when a quorum of the data nodesidentifying themselves as part of the current configuration agree uponthe configuration version. If a quorum is achieved for multipleconfiguration versions of the same partition, the more recentconfiguration version is activated in the GPM. In some embodiments, themore recent configuration version will be identifies as theconfiguration version with the highest numerical representation.

In some embodiments, data/information related to reconfiguration ofnodes, (the nodes are associated with a distributed system thatimplements dynamic quorums of read/write conditions) is reconstructedvia a reconstruction component. In one aspect, the reconstructioncomponent enables storing replicating partial copies of the informationacross the distributed system itself. Such distributed segments can thenbe employed to reconstruct content of the central management system in aconsistent manner. Accordingly, the reconstruction component canreconstruct the central management component contents, including theglobal partition map, from various locations on the system—wherein thecentral management component/configuration component can be treated as acache. Moreover, scalability can be provided via protocol partitioningof the central management component (e.g., employing a same protocol asemployed to make other parts of the system highly available). Likewise,employing a central management component for leadership election for therest of the system allows for flexibility and scale, (typically notafforded if using conventional consensus based leadership electionalgorithm.)

In a related aspect, the configuration manager component can bereplicated to a number of master machines that form the master cluster.Each of these nodes can interact with a respective reconfiguration agentwith which the local instance of the Configuration Manager interacts.Moreover, the primary reconfiguration agent for the master cluster canbe selected by a reliable consensus algorithm, which can be provided bythe communication layer and the old and new configuration membershipsets are determined by system configuration.

Accordingly, the reconstruction component can replicate theconfiguration manager component, and hence enable the configurationmanager component to be readily available even in the loss of less thana quorum of master cluster machines. Put differently, the subjectinnovation enables restoration of the configuration manager componentcontents from various portions of the distributed system of nodes.

In a related aspect, partition related information can be restored fromthe replicas that are part of the more recent configuration for thatpartition. As part of the reconfiguration algorithm, each replica storesits local view of what is latest, or latest proposed, configuration forthe partition. Since a configuration becomes active when a write quorumof replicas accept the new configuration, the subject innovation candetermine which configuration is the most recent by identifying aconfiguration where a write quorum of replicas report that particularconfiguration as the latest. (This configuration is typically guaranteedto be the latest, assuming nodes cannot be rolled back in time, becausethere can only exist one such configuration since the currentconfiguration must be deactivated before a new configuration isactivated. The deactivation of the current/old configuration effectivelydestroys that configurations ability to form a quorum.)

According to a further methodology, when a catastrophic loss on themaster cluster is detected, the system initiates a configuration managerrebuild by initially destroying any partial information left on themaster cluster machines (since some machines can actually survive). Themethodology subsequently requests each machine in thecluster/configuration of nodes to send its respective most current(e.g., latest) configurations for the partitions of which they holdreplicas—wherein the configuration manager component receives suchstatus messages. Each of the messages enable the configuration managercomponent to learn about partitions that existed in the system, thereplicas on a particular machine, replicas on other machines that wereknown to the reporting replica, and machines known to the reportingmachine that may not have reported their status. The configurationmanager component can render a partition active again when it hasreceived a write quorum of messages where the replicas for the partitionreport the same latest configuration, wherein such quorum depends on theconfiguration itself. Hence, as long as a write quorum of replicas forlatest configuration of a partition report and there was noreconfiguration active during the catastrophic loss—then the system canensure an automatic recovery of the partition. Likewise, if areconfiguration was active up to a read quorum of the old configuration,then a write quorum of the new configuration can typically be requiredto ensure accurate restoration (although fewer reports suffice dependingon the phase of the reconfiguration.)

FIG. 1 illustrates a block diagram for a configuration manager 100 thatemploys a reconstruction component 101, which enables reconstructinginformation related to reconfiguring members of a distributed system.Such reconstruction component 101 can further be associated with aleader elector component 102 and a cluster configuration component 103,which can facilitate designation/operations associated with a primary(e.g., active) configuration manager instance/components. In one aspect,the reconstruction component 101 enables replicating partial copies ofthe information across the distributed system itself. Such distributedsegments/pieces can then be employed to reconstruct contents of thecentral management system in a consistent manner. Accordingly, thereconstruction component 101 can reconstruct central managementcomponent contents from various locations on the system, wherein thecentral management component/configuration component can be treated as acache. Moreover, scalability can be provided via protocol partitioningof the central management component (e.g., using a same protocol asemployed to make other parts of the system highly available). Inaddition, employing a central management component for leadershipelection for the rest of the system allows for flexibility and scale,which is typically not afforded if using conventional consensus basedleadership election algorithm.

Reconstruction component 101 may be implemented in any suitable way. Insome embodiments, reconstruction component 101 may be in a computerdevice coupled to master nodes, 110 ₁, 110 ₂ and 110 ₃ over a network.Such a computer device may be programmed with computer-executableinstructions that monitors for events, as described above, that maytrigger a reconstruction of the configuration manager as describedabove. When such an event is detected, reconstruction component 101 mayalso issue commands and received responses that control thereconstruction process.

In some embodiments, reconstruction component 101 may additionallyperform functions that control the primary nodes to establish that atleast a subset of the available master nodes is configured to replicatea current version of the configuration information held withinconfiguration manager 100. However, such control functions mayalternatively or additionally be implemented in any suitable components.

In the embodiment illustrated, reconstruction component 101 is shown asa component separate from each of the master nodes. Though, it should beappreciated that reconstruction component 101 may be implemented in anysuitable hardware, including in a primary master node.

FIG. 1 illustrates that configuration manager 100 is distributed acrossmultiple master nodes. Here three master nodes, 110 ₁, 110 ₂ and 110 ₃are shown. However, any suitable number of master nodes may be employedin a system and some or all of which may be configured at any given timeto constitute a configuration manager.

In the embodiment illustrated, each of the master nodes 110 ₁, 110 ₂ and110 ₃ is shown to be implemented with the same hardware. Such aconfiguration is provided for simplicity of illustration and each masternode may be implemented with any suitable hardware or hardwarecomponents. However, taking master nodes 110 ₃ as illustrative, eachmaster node may contain a data store 112, implemented in any suitablecomputer storage media, in which configuration information may bestored. Additionally, a master node may contain a reconfiguration agent114 and a configuration manager component 116. In some embodiments,reconfiguration agent 114 and configuration manager component 116 may beimplemented as computer executable instructions executed on a processor,such as may exist in a server or other computer device hosting a masternode.

In operation, configuration manager component 116 may manage theconfigurations of the data nodes in a distributed database to whichconfiguration manager 100 is coupled via a network. Managementoperations may include tracking active nodes in a partition to ascertainthe number of active data nodes replicating the partition and addingdata nodes to a configuration if there are an insufficient number ofdata nodes. In addition, configuration manager component 116 may performother actions related to managing the partition, including providinginformation to other components accessing the database with informationon data nodes from which data in one or more partitions can be obtained.Configuration manager component 116 may also perform other actionsassociated with a configuration manager as is known in the art or anyother suitable actions.

In operation, reconfiguration agent 114 may interact with similarreconfiguration agents in other master nodes to ensure that each masternode in a master cluster maintains a consistent replica of theconfiguration information. For example, when a change is made toinformation on one node, the reconfiguration agent on that node maydistribute change information to reconfiguration agents on other nodes.However, it should be recognized that functions of a master node neednot be implemented in two components as shown. All functions may beimplemented in a single component or in more than two components.

As noted above, at any given time, one of the master nodes may bedesignated as the primary master node. The primary node may perform allcontrol functions of the configuration manager and initiate all changesto the configuration information stored in the configuration manager.Other master nodes in the current configuration may receive such changesand make corresponding changes to maintain a consistent replica. In theembodiment illustrated, master node 1102 is the current primary node.

A master node may be selected to act as the primary node in any suitableway. In some embodiments, the master node is designated by a networkadministrator. Though, as described in connection with FIG. 3, below, anautomated technique for selecting a primary master node may also beemployed.

FIG. 2 illustrates a block diagram for a system 200 in which aconfiguration manager can be reconstructed according to an exemplaryaspect. As illustrated in FIG. 2, each of the data nodes storesinformation about a configuration to which it has been assigned. At thetime a data node is assigned to a configuration and receives a currentcopy of data being maintained by the distributed system, the informationstored in that data node is up-to-date. The data in each data node mayrepresent a partition of a database. In some embodiments, a database maycontain a single partition such that each data node that is part of thecurrent configuration contains a full copy of the database. In otherembodiments, though, the database may contain multiple partitions andeach data node may store only a subset of the database.

Regardless of how much of the database is stored in an active node, overtime, due to hardware failures or other causes, one or more data nodesmay not receive updates to the replicated data or the configuration.Accordingly, though the information stored in the data node itself mayindicate that the node is up-to-date, that information may actually beincorrect. Accordingly, a quorum-based approach may be used foridentifying data nodes that agree on the current configuration of thedatabase. FIG. 2 provides an example of a manner in which quorum-basedprocessing may be used to identify a current configuration based oninformation read from multiple nodes of the distributed system. Though,it should be appreciated that this information need not be read inresponse to a command initiated by a configuration manager,reconstruction component or other component. In some embodiments, thisinformation is provided from the data nodes in response to a systemrestart or other event.

In the example shown in FIG. 2, for partition X of data (e.g., asegment/replica of data) configuration M consists of data node D anddata node E, and yet as illustrated only data node D has reported suchconfiguration. Likewise, configuration N consists of data nodes A, B,and C—wherein A, B, and E have reported such configuration. It is notedthat data node E does not count in this scenario, as this node is notpart of such configuration; but still A and B form a write quorum (2 outof 3)—hence; configuration N should in fact represent the latestconfiguration.

The configuration version and data node information for the latestconfiguration version are shown recorded as a global partition map inthe configuration manager. This configuration information could havebeen stored in the configuration manager as the data nodes wereconfigured. However, as illustrated in FIG. 2, this configurationinformation may be derived from messages sent by the data nodes, eachidentifying the information it has stored indicating the currentconfiguration for each partition for which data is stored on the datanode. In this way, the configuration information can be recreated basedon messages from the data nodes.

FIG. 3 illustrates an approach by which a set of nodes can be organizedto uniquely identify a node as a primary node. Such an approach may beused to automatically identify a master node to act as a primary masternode.

FIG. 3 is a block diagram of a system 300 that implements aconfiguration manager component 302 in conjunction with a plurality ofnodes as part of a distributed environment such as a ring 310—which canbe reconstructed in accordance with an aspect of the subject innovation.The configuration manager component 302 can reconfigure members of adistributed system of nodes (e.g., servers) from an old configuration toa new configuration, in a transactionally consistent manner byimplementing dynamic quorums based read/write conditions, which mitigatedata loss during such transformation. Such quorum can represent apredetermined number, wherein an addition of the read quorum and thewrite quorum exceeds number of nodes for the configuration (e.g., theread and write quorums of a given configuration overlap). Though,similar processing may be used to create a new configuration, evenwithout an old configuration, and may be used, for example, if acatastrophic failure has created a need to reconstruct the configurationmanager.

As illustrated in FIG. 3, in general, when a first node N₁ 301 comes upin a ring 310, it can create a token that covers the entire numberspace, and can be referred to as the initial token creation.Subsequently, a token can ideally only be transferred among the nodes(N_(t) to N_(m) where m is an integer), so that typically, no two nodescan have overlapping tokens at any time For example, in a simplest forman administrator can explicitly indicate whether a node is a first node.

After the initial creation of the token, such a token needs to be splitwhenever a new node joins in the ring and requires a merger when anexisting node leaves the ring and therefore gives up its token to someother node(s). Typically, the ring 310 is associated with a federationthat can consist of a set of nodes that cooperate among themselves toform a dynamic and scalable network, wherein information can besystematically and efficiently disseminated and located. Moreover, thenodes participating in a federation can be represented as a sorted listusing a binary relation that is reflexive, anti-symmetric transitive,total, and defined over the domain of node identities. For example, bothends of the sorted list can be joined, thereby forming a ring 310. Suchprovides for each node in the list to view itself as being at the middleof the sorted list. In a related aspect, the list can be doubly linkedsuch that a node can traverse the list in either direction. Moreover, aone-to-one mapping function can be defined from the value domain of thenode identities to the nodes themselves. Such mapping function accountsfor the sparseness of the nodes in the value domain when the mapping isnot tight.

As such, every node participating in the federation is assigned anatural number that is between 0 and some appropriately chosen upperbound, inclusive, and that that range does not have to be consecutive(e.g., there can exist gaps between numbers assigned to nodes). Suchnumber assigned to a node acts as its identity in the ring. The mappingfunction accounts for gaps in the number space by mapping a number beingpositioned in between two node identities to the node having an identitythat is numerically closest to the number. Accordingly, by assigningeach node a uniformly distributed number, it can be ensured that allsegments of the ring are uniformly populated. Moreover and as describedin detail infra, nodes that indicate the successor, predecessor, andneighborhood computations can be performed efficiently using moduloarithmetic.

In such an arrangement, routing consistency can be achieved viaassignment and ownership of tokens. Typically, a node can accept amessage only when it has an ownership token on the ID to which themessage is destined. As explained above, a token contains a consecutiverange of IDs and every token has an owner. A token in transit isconsidered not to exist until it is accepted by a node. Moreover, therange of two tokens must in general be disjoint—wherein all token rangesare disjoint, and a token can be split into two adjacent tokens. Inaddition, two or more adjacent tokens can be merged into a single token,wherein a node does not accept a message without a corresponding token.Additionally, a node must typically own a token that includes at leastits own ID. A node owning a token is referred to be in the routing stageand can also be referred to as a routing node. A routing node owns onlya single token, or, a single range of IDs, for example. Eventually, thetoken for an ID will be owned by a routing node that is closest to thatID (e.g., the liveness property). Token transfer should be synchronizedwith the transfer of data that is stored at any ID in the range of thetoken. More precisely, token transfer can typically occur only afterdata transfer is completed. In general, a node that owns a routing tokencan be referred to as a routing node.

The interactions described above associated with organizing nodes into aring as illustrated in FIG. 3 may be performed by any suitablecomponents. In some embodiments, messages may be sent and received undercontrol of the available master nodes in a system. In other embodiments,the interactions may be performed under control of an interconnectionfabric, implemented by components that interconnect the master nodes ina network.

FIG. 4A illustrates a related methodology 400 for various stages ofconfiguring a network of nodes. The process may be employed to configuredata nodes storing a partition of a database. Though, a similar processmay be used to configure master nodes into a master cluster.

Each partition of data in the distributed system is stored on a set ofdata nodes. One of the data nodes may be designated as the primaryreplica for the partition. The remaining data nodes for the partitionmay be designated as secondary replicas. Upon receipt of areconfiguration request, a reconfiguration agent on the primary replicacan initiate deactivation for an old or existing configuration, andsupply a further activation of the new configuration (e.g., ensuringthat any transactions whose commits were acknowledged to the client willbe retained by the new configuration; and transactions which had notcommitted or whose commit had not been acknowledged can either becommitted or rolled back.) Such can include implementation of fourstages, namely:

Phase 1: Ballot and Catch-up at 410

During this phase the primary replica of the partition proposes aglobally unique ID for the new configuration of the partition. Uponacceptance by a quorum of replicas of both the old and newconfigurations, such ID is guaranteed to be greater than any previouslyaccepted ID for this replication unit. The proposed ID is sent to allreplicas in both the old and new configurations each of which accepts orrejects the ID based on whether it is greater than any ID they haveobserved previously. Accordingly, if a replica accepts such ID it canfurther notify the primary replica of its latest transaction sequencenumber and halts acceptance of new transactions.

Alternatively, if a replica rejects the proposed ID, the primary picksare new higher ID and restarts Phase 1. Once a quorum of replicas fromboth the old and new configuration has accepted the proposed ID, theprimary directs the replicas in the new configuration to start catchingup so that the transactional consistency and data safety requirementsare maintained across the reconfiguration. Such can involve a mixture ofcatch-up and transaction rollbacks on individual replicas. Moreover, theprocess is guaranteed to result in a quorum of replicas agreeing on thecurrent state for the content and provides Atomicity, Consistency,Isolation, Durability (ACID) properties across the reconfiguration.Phase 1 can be complete once at least a quorum of replicas in the newconfiguration has been caught up.

Phase 2: Deactivation of Old Configuration at 420

During this phase the primary replica coordinates the deactivation ofthe old configuration. The purpose of deactivation is to guarantee thatit is never possible to find two sets of replicas R1 and R2 such thatR1=R2 and each replica r1 in R1 claims that configuration C1 is thelatest configuration and R1 forms a write quorum of C1 and each replicar2 in R2 claims that configuration C2 is the latest configuration and R2forms a write quorum of C2; unless C1=C2. Moreover, a deactivationmessage can be sent to each replica in the old configuration. Each ofthe replicas can accept the deactivation if it matches the latest ballotproposal it has accepted. This phase is complete when a read quorum ofreplicas acknowledges the deactivation.

Phase 3: Activation of New Configuration, at 430

During such phase the primary replica coordinates the activation of thenew configuration. A purpose of activation is to guarantee that a writequorum of the new configuration knows that the configuration has beenactivated before changes to the content of the replication unit areallowed. Such can ensure that any content changes can be lost only ifquorum of nodes are lost. The activation message can further be sent toeach replica in the new configuration. Each of these replicas can acceptthe activation if it matches the latest ballot proposal it has accepted.Such phase is complete when a write quorum of replicas in the newconfiguration has accepted the activation. At this point the newconfiguration is active and useable.

Phase 4: Commit at 440

Such stage is an optional phase for committing the reconfiguration—sinceat end of Phase 3 the old configuration has been deactivated and the newconfiguration has been activated. Yet, such is known only to the primaryreplica and from a global outside-of-system perspective. Accordingly,such commit phase distributes this knowledge throughout all interestedparties in the distributed system, namely to each replica in the old andnew configurations as well as the Configuration Manager.

FIG. 4B is a flow diagram of a method 450 for managing a distributeddatabase system. At step 451, a configuration of the database is built.Specifically, the database may be organized as one or more partitions.Each partition of the database is replicated by a set of assigned datanodes. Initial configuration may be performed manually or may beautomated in any suitable way. Because the partitions may bereconfigured, a configuration version may be used to identify thecurrent configuration of each partition.

As part of the initial configuration, a set of master nodes forms amaster cluster within the distributed system. At step 453, theconfiguration of the database system is recorded as configurationinformation by the master nodes of the master cluster. In someembodiments, the configuration information maps each partition to thedata nodes on which it is replicated. The configuration information mayfurther include information about the status of each partition, such aswhich nodes are alive, which node is the primary node for eachpartition, and the configuration version of each partition. Theconfiguration information may be implemented, for example, as a globalpartition map.

At step 455, the distributed system receives a request to access datafrom a partition. The request may, for example, be a request to readdata from a partition or write data to a partition. The request may bereceived, for example, from a client computer wishing to access thedatabase of the distributed system.

To service the request, the distributed system may determine which datanode contains data to service the request. If the configuration managercontains a reliable copy of the configuration information, it candetermine which data node will service the request from theconfiguration information stored by the master nodes. At step 457, thedistributed system determines whether a quorum of the master nodesexists such that the quorum identifies the same configuration of masternodes as holding the current configuration information for thedistributed database.

If it is determined that a quorum exists, the primary node may providethe requested information. Accordingly, method 450 continues to step459. At step 459, the primary data node for the partition identified bythe quorum of master nodes is read. At step 461, the requesting clientis provided the data accessed from the primary data node. Similarly, ifa write request is made by the client, the requested data to be writtenis provided to the primary data node.

If however, at step 457, it is determined that a quorum of the masternodes does not exist, the system may determine to rebuild the mastercluster. Processing at step 457 may alternatively or additionallyinclude other processing that may lead to an identification of a triggercondition, such as a catastrophic hardware failure, for rebuilding theconfiguration manager. In this scenario, processing branches to step463.

At step 463, the master cluster is reset. The reset may entail erasingfrom all of the master nodes the current configuration information inpreparation for rebuilding the configuration manager.

At step 465, a primary master node is selected. In some embodiments, thecurrent primary master node, if alive, is designated as the primary. Inother embodiments, processing is used to uniquely identify a master nodeas the new primary master node. For example, the master nodes may beconfigured into a token ring as described above in connection with FIG.3. In such an embodiment, a token is passed from node to node, assigningordered positions to the master nodes. The new primary master node isselected as the master node with position 0.

At step 467, messages from data nodes are received. In this embodiment,the messages are received at the new primary master node. However, themessages may be received and processed in any suitable component. Eachdata node may provide a message to the master cluster indicating aconfiguration of the database. For example, a data node may report tothe master cluster the partition or partitions of the database which itreplicates and the configuration of each partition. Namely, the datanode may specify a configuration version of the partition, an indicationof the partitions primary replica, an indication of any secondaryreplicas for the partition, and a status of the partition. The statusmay indicate, for example, that the partition is active on the data nodeor that the data node is part of a new configuration of the partitionthat has not yet been activated.

Though, it should be appreciated that not every possible data node maysend such messages. For example, some subset of the data nodes, such asonly active data nodes or only data nodes that store configurationinformation indicating that the node is part of the currentconfiguration for at least one partition, may send such messages.Moreover, it should be recognized that only a quorum of data nodes in acurrent partition are required to send messages for the currentconfiguration to be identified. Accordingly, the component receiving themessages at step 467 may collect messages until it receives messagesidentifying a quorum or may collect messages for some suitable period oftime, without waiting to receive a message from every possible datanode.

Regardless of how many messages are received, processing may proceed tostep 469. At step 469, the configuration information is rebuilt based oninformation provided from the data nodes. The rebuild process isdescribed with reference to method 470 shown in FIG, 4C. In someembodiments, steps 463, 465, 467, and 469 are performed by areconstruction component, such as reconstruction component 101 (FIG. 1).

FIG. 4C is a flow diagram of a method 470 for rebuilding configurationinformation from data nodes in a distributed system. Though FIG. 4Cillustrates processing for a single partition, the method 470 may beperformed for each partition of the database in the distributed systemusing the information provided from the data nodes. In this way,configuration information relating to the entire database may bereconstructed.

At step 471, it is determined whether the partition was undergoingreconfiguration at the time that the messages were sent, meaning thatthe partition was being migrated from one set of data nodes to another.Status information provided by a data node for the partition may be usedto determine whether the partition is undergoing reconfiguration. Suchprocessing may be useful, for example, to prevent errors fromreconstructing a partition using information that was in an inconsistentstate because of a catastrophic error to the configuration managerduring the reconfiguration process.

If it is determined at step 471 that the partition is not beingreconfigured, method 470 proceeds to step 473 where it is determined ifa write quorum of the data nodes for the configuration version of thepartition exists. The presence of a write quorum may be determined fromthe messages reported by the data nodes. If those messages contain a setof messages, sent by different nodes, consistently identifying a set ofnodes as the current configuration, that set possible may be the currentconfguration. If a quorum of the data nodes identified as being thecurrent configuration send messages indicating that they are active asthe current configuration, that set of nodes may be deemed to representthe current configuration. At block 473, the messages received at block467 may be searched to find a set of messages meeting the criteria foridentifying the current configuration.

If a write quorum exists, method 470 continues to step 475 where thecurrent configuration of the partition as verified by the write quorumis written into the configuration manager as the current configurationinformation.

If, however, a write quorum is not found at step 473, it may not bepossible to rebuild the configuration information. Accordingly, anexception condition may be identified, which may be handled in anysuitable way. In some embodiments, the processing of FIGS. 4B and 4C forreconstructing configuration information may be performed automatically.However, exception processing may require manual intervention.

Returning, back to step 471, if it is determined at step 471 that areconfiguration of the partition is active, method 470 proceeds to step477. At step 477, it is determined whether a read quorum of an oldconfiguration of the partition and a write quorum of the newconfiguration is present. In some embodiments, only whether a writequorum of the new configuration exists is checked at step 477.

If the appropriate quorums exist, the distributed database may be deemedto have been in a consistent state at the time of the event, such as acatastrophic failure of the configuration manager, that triggered therebuild of the configuration manager. Accordingly, at step 479 theconfiguration information is updated in the master cluster with the newconfiguration of the partition as verified by the write quorum of thenew configuration. The new quorum may optionally be activated.

Failure to obtain the appropriate quorums at step 477 results in anexception. The exception may indicate, for example, that distributeddatabase was in an inconsistent state such that manual intervention orother exception processing is required.

As used in this application, the terms “component”, “system”, areintended to refer to a computer-related entity, either hardware, acombination of hardware and software, software, or software inexecution. For example, a component can be, but is not limited to being,a process running on a processor, a processor, an object, an executable,a thread of execution, a program, and/or a computer. By way ofillustration, both an application running on a server and the server canbe a component. One or more components can reside within a processand/or thread of execution, and a component can be localized on onecomputer and/or distributed between two or more computers.

Furthermore, all or portions of the subject innovation can beimplemented as a system, method, apparatus, or article of manufactureusing standard programming and/or engineering techniques to producesoftware, firmware, hardware or any combination thereof to control acomputer to implement the disclosed innovation. For example, computerreadable media can include but are not limited to magnetic storagedevices (e.g., hard disk, floppy disk, magnetic strips . . . ), opticaldisks (e.g., compact disk (CD), digital versatile disk (DVD) . . . ),smart cards, and flash memory devices (e.g., card, stick, key drive . .. ). Additionally it should be appreciated that a carrier wave can beemployed to carry computer-readable electronic data such as those usedin transmitting and receiving electronic mail or in accessing a networksuch as the Internet or a local area network (LAN). Of course, thoseskilled in the art will recognize many modifications may be made to thisconfiguration without departing from the scope or spirit of the claimedsubject matter.

In order to provide a context for the various aspects of the disclosedsubject matter, FIGS. 5 and 6 as well as the following discussion areintended to provide a brief, general description of a suitableenvironment in which the various aspects of the disclosed subject mattermay be implemented. While the subject matter has been described above inthe general context of computer-executable instructions of a computerprogram that runs on a computer and/or computers, those skilled in theart will recognize that the innovation also may be implemented incombination with other program modules. Generally, program modulesinclude routines, programs, components, data structures, and the like,which perform particular tasks and/or implement particular abstract datatypes. Moreover, those skilled in the art will appreciate that theinnovative methods can be practiced with other computer systemconfigurations, including single-processor or multiprocessor computersystems, mini-computing devices, mainframe computers, as well aspersonal computers, hand-held computing devices (e.g., personal digitalassistant (PDA), phone, watch . . . ), microprocessor-based orprogrammable consumer or industrial electronics, and the like. Theillustrated aspects may also be practiced in distributed computingenvironments where tasks are performed by remote processing devices thatare linked through a communications network. However, some, if not allaspects of the innovation can be practiced on stand-alone computers. Ina distributed computing environment, program modules may be located inboth local and remote memory storage devices.

With reference to FIG. 5, an exemplary environment 510 for implementingvarious aspects of the subject innovation is described that includes acomputer 512. The computer 512 includes a processing unit 514, a systemmemory 516, and a system bus 518. The system bus 518 couples systemcomponents including, but not limited to, the system memory 516 to theprocessing unit 514. The processing unit 514 can be any of variousavailable processors. Dual microprocessors and other multiprocessorarchitectures also can be employed as the processing unit 514.

The system bus 518 can be any of several types of bus structures)including the memory bus or memory controller, a peripheral bus orexternal bus, and/or a local bus using any variety of available busarchitectures including, but not limited to, 11-bit bus, IndustrialStandard Architecture (ISA), Micro-Channel Architecture (MSA), ExtendedISA (EISA), Intelligent Drive Electronics (IDE), VESA Local Bus (VLB),Peripheral Component Interconnect (PCI), Universal Serial Bus (USB),Advanced Graphics Port (AGP), Personal Computer Memory CardInternational Association bus (PCMCIA), and Small Computer SystemsInterface (SCSI).

The system memory 516 includes volatile memory 520 and nonvolatilememory 522. The basic input/output system (BIOS), containing the basicroutines to transfer information between elements within the computer512, such as during start-up, is stored in nonvolatile memory 522. Forexample, nonvolatile memory 522 can include read only memory (ROM),programmable ROM (PROM), electrically programmable ROM (EPROM),electrically erasable ROM (EEPROM), or flash memory. Volatile memory 520includes random access memory (RAM), which acts as external cachememory. By way of illustration and not limitation, RAM is available inmany forms such as synchronous RAM (SRAM), dynamic RAM (DRAM),synchronous DRAM (SDRAM), double data rate SDRAM (DDR SDRAM), enhancedSDRAM (ESDRAM). Synchlink DRAM (SLDRAM), and direct Rambus RAM (DRRAM).

Computer 512 also includes removable/non-removable,volatile/non-volatile computer storage media. FIG. 5 illustrates a diskstorage 524, wherein such disk storage 524 includes, but is not limitedto, devices like a magnetic disk drive, floppy disk drive, tape drive,Jaz drive, Zip drive, LS-60 drive, flash memory card, or memory stick.In addition, disk storage 524 can include storage media separately or incombination with other storage media including, but not limited to, anoptical disk drive such as a compact disk ROM device (CD-ROM), CDrecordable drive (CD-R Drive), CD rewritable drive (CD-RW Drive) or adigital versatile disk ROM drive (DVD-ROM). To facilitate connection ofthe disk storage devices 524 to the system bus 518, a removable ornon-removable interface is typically used such as interface 526.

It is to be appreciated that FIG. 5 describes software that acts as anintermediary between users and the basic computer resources described insuitable operating environment 510. Such software includes an operatingsystem 528. Operating system 528, which can be stored on disk storage524, acts to control and allocate resources of the computer system 512.System applications 530 take advantage of the management of resources byoperating system 528 through program modules 532 and program data 534stored either in system memory 516 or on disk storage 524. It is to beappreciated that various components described herein can be implementedwith various operating systems or combinations of operating systems.

A user enters commands or information into the computer 512 throughinput device(s) 536. Input devices 536 include, but are not limited to,a pointing device such as a mouse, trackball, stylus, touch pad,keyboard, microphone, joystick, game pad, satellite dish, scanner, TVtuner card, digital camera, digital video camera, web camera, and thelike. These and other input devices connect to the processing unit 514through the system bus 518 via interface port(s) 538. Interface port(s)538 include, for example, a serial port, a parallel port, a game port,and a universal serial bus (USB). Output device(s) 540 use some of thesame type of ports as input device(s) 536. Thus, for example, a USB portmay be used to provide input to computer 512, and to output informationfrom computer 512 to an output device 540. Output adapter 542 isprovided to illustrate that there are some output devices 540 likemonitors, speakers, and printers, among other output devices 540 thatrequire special adapters. The output adapters 542 include, by way ofillustration and not limitation, video and sound cards that provide ameans of connection between the output device 540 and the system bus518. It should be noted that other devices and/or systems of devicesprovide both input and output capabilities such as remote computer(s)544.

Computer 512 can operate in a networked environment using logicalconnections to one or more remote computers, such as remote computer(s)544. The remote computer(s) 544 can be a personal computer, a server, arouter, a network PC, a workstation, a microprocessor based appliance, apeer device or other common network node and the like, and typicallyincludes many or all of the elements described relative to computer 512.For purposes of brevity, only a memory storage device 546 is illustratedwith remote computer(s) 544. Remote computer(s) 544 is logicallyconnected to computer 512 through a network interface 548 and thenphysically connected via communication connection 550. Network interface548 encompasses communication networks such as local-area networks (LAN)and wide-area networks (WAN). LAN technologies include Fiber DistributedData Interface (FDDI), Copper Distributed Data Interface (CDDI),Ethernet/IEEE 802.3, Token Ring/IEEE 802.5 and the like. WANtechnologies include, but are not limited to, point-to-point links,circuit switching networks like Integrated Services Digital Networks(ISDN) and variations thereon, packet switching networks, and DigitalSubscriber Lines (DSL).

Communication connection(s) 550 refers to the hardware/software employedto connect the network interface 548 to the bus 518. While communicationconnection 550 is shown for illustrative clarity inside computer 512, itcan also be external to computer 512. The hardware/software necessaryfor connection to the network interface 548 includes, for exemplarypurposes only, internal and external technologies such as, modemsincluding regular telephone grade modems, cable modems and DSL modems,ISDN adapters, and Ethernet cards.

FIG. 6 is a schematic block diagram of a sample-computing environment600 that can be employed for implementing nodes as part of a federation,in accordance with an aspect of the subject innovation. The system 600includes one or more client(s) 610. The client(s) 610 can be hardwareand/or software (e.g., threads, processes, computing devices). Thesystem 600 also includes one or more server(s) 630. The server(s) 630can also be hardware and/or software (e.g., threads, processes,computing devices). The servers 630 can house threads to performtransformations by employing the components described herein, forexample. One possible communication between a client 610 and a server630 may be in the form of a data packet adapted to be transmittedbetween two or more computer processes. The system 600 includes acommunication framework 650 that can be employed to facilitatecommunications between the client(s) 610 and the server(s) 630. Theclient(s) 610 are operatively connected to one or more client datastore(s) 660 that can be employed to store information local to theclient(s) 610. Similarly, the server(s) 630 are operatively connected toone or more server data store(s) 640 that can be employed to storeinformation local to the servers 630.

Having thus described several aspects of at least one embodiment of thisinvention, it is to be appreciated that various alterations,modifications, and improvements will readily occur to those skilled inthe art.

As an example of a possible variation, in an exemplary embodimentdescribed above, a quorum of nodes was selected to be a majority of thenodes. Other implementations are possible, with the quorum being eithergreater or less than a majority of the nodes. Moreover, the quorum maychange over time for a configuration as nodes fail or go off-line.

As an additional example, the present application uses as an example asystem in which loss of the primary master node is regarded as acatastrophic failure that triggers a rebuild of the configurationmanager. It is not a requirement that the loss of a primary master nodetrigger a rebuild of the configuration manager. If one or more replicasof the current configuration information can be reliably identified, theconfiguration manager can be reset based on this information.

Such alterations, modifications, and improvements are intended to bepart of this disclosure, and are intended to be within the spirit andscope of the invention. Accordingly, the foregoing description anddrawings are by way of example only.

The above-described embodiments of the present invention can beimplemented in any of numerous ways. For example, the embodiments may beimplemented using hardware, software or a combination thereof. Whenimplemented in software, the software code can be executed on anysuitable processor or collection of processors, whether provided in asingle computer or distributed among multiple computers.

Further, it should be appreciated that a computer may be embodied in anyof a number of forms, such as a rack-mounted computer, a desktopcomputer, a laptop computer, or a tablet computer. Additionally, acomputer may be embedded in a device not generally regarded as acomputer but with suitable processing capabilities, including a PersonalDigital Assistant (PDA), a smart phone or any other suitable portable orfixed electronic device.

Also, a computer may have one or more input and output devices. Thesedevices can be used, among other things, to present a user interface.Examples of output devices that can be used to provide a user interfaceinclude printers or display screens for visual presentation of outputand speakers or other sound generating devices for audible presentationof output. Examples of input devices that can be used for a userinterface include keyboards, and pointing devices, such as mice, touchpads, and digitizing tablets. As another example, a computer may receiveinput information through speech recognition or in other audible format.

Such computers may be interconnected by one or more networks in anysuitable form, including as a local area network or a wide area network,such as an enterprise network or the Internet. Such networks may bebased on any suitable technology and may operate according to anysuitable protocol and may include wireless networks, wired networks orfiber optic networks.

Also, the various methods or processes outlined herein may be coded assoftware that is executable on one or more processors that employ anyone of a variety of operating systems or platforms. Additionally, suchsoftware may be written using any of a number of suitable programminglanguages and/or programming or scripting tools, and also may becompiled as executable machine language code or intermediate code thatis executed on a framework or virtual machine.

In this respect, the invention may be embodied as a computer readablemedium (or multiple computer readable media) (e.g., a computer memory,one or more floppy discs, compact discs, optical discs, magnetic tapes,flash memories, circuit configurations in Field Programmable Gate Arraysor other semiconductor devices, or other tangible computer storagemedium) encoded with one or more programs that, when executed on one ormore computers or other processors, perform methods that implement thevarious embodiments of the invention discussed above. The computerreadable medium or media can be transportable, such that the program orprograms stored thereon can be loaded onto one or more differentcomputers or other processors to implement various aspects of thepresent invention as discussed above.

The terms “program” or “software” are used herein in a generic sense torefer to any type of computer code or set of computer-executableinstructions that can be employed to program a computer or otherprocessor to implement various aspects of the present invention asdiscussed above. Additionally, it should be appreciated that accordingto one aspect of this embodiment, one or more computer programs thatwhen executed perform methods of the present invention need not resideon a single computer or processor, but may be distributed in a modularfashion amongst a number of different computers or processors toimplement various aspects of the present invention.

Computer-executable instructions may be in many forms, such as programmodules, executed by one or more computers or other devices. Generally,program modules include routines, programs, objects, components, datastructures, etc. that perform particular tasks or implement particularabstract data types. Typically the functionality of the program modulesmay be combined or distributed as desired in various embodiments.

Also, data structures may be stored in computer-readable media in anysuitable form. For simplicity of illustration, data structures may beshown to have fields that are related through location in the datastructure. Such relationships may likewise be achieved by assigningstorage for the fields with locations in a computer-readable medium thatconveys relationship between the fields. However, any suitable mechanismmay be used to establish a relationship between information in fields ofa data structure, including through the use of pointers, tags or othermechanisms that establish relationship between data elements.

Various aspects of the present invention may be used alone, incombination, or in a variety of arrangements not specifically discussedin the embodiments described in the foregoing and is therefore notlimited in its application to the details and arrangement of componentsset forth in the foregoing description or illustrated in the drawings.For example, aspects described in one embodiment may be combined in anymanner with aspects described in other embodiments.

Also, the invention may be embodied as a method, of which an example hasbeen provided. The acts performed as part of the method may be orderedin any suitable way. Accordingly, embodiments may be constructed inwhich acts are performed in an order different than illustrated, whichmay include performing some acts simultaneously, even though shown assequential acts in illustrative embodiments.

Use of ordinal terms such as “first,” “second,” “third,” etc., in theclaims to modify a claim element does not by itself connote anypriority, precedence, or order of one claim element over another or thetemporal order in which acts of a method are performed, but are usedmerely as labels to distinguish one claim element having a certain namefrom another element having a same name (but for use of the ordinalterm) to distinguish the claim elements.

Also, the phraseology and terminology used herein is for the purpose ofdescription and should not be regarded as limiting. The use of“including,” “comprising,” or “having,” “containing,” “involving,” andvariations thereof herein, is meant to encompass the items listedthereafter and equivalents thereof as well as additional items.

1. A method of obtaining configuration information defining a currentconfiguration of a plurality of data nodes storing replicas of apartition of a database, the method comprising: operating at least oneprocessor to perform acts comprising: receiving a plurality of messages,each message generated by a data node of the plurality of data nodes andindicating a version of the configuration of the database for which thedata node is configured and a set of data nodes configured in accordancewith the indicated configuration to replicate the partition stored onthe data node; identifying, based on the received messages, a selectedset of data nodes, the selected set of data nodes being a set identifiedin at least one of the plurality of messages for which a quorum of thedata nodes in the set each generated a message indicating the sameconfiguration version and the selected set of data nodes; and storing asa portion of the configuration information an indication that each datanode of the selected set is a data node storing a replica of thepartition.
 2. The method of claim 1, wherein the plurality of messagescomprise messages from at least half of the data nodes configured tostore the partition, and the data nodes forming the quorum comprise atleast half of the data nodes storing the partition.
 3. The method ofclaim 1, further comprising: sending a request to the plurality of datanodes storing the database for each to provide a respective messageamong the plurality of messages.
 4. The method of claim 3, wherein thestoring comprises: storing the configuration information in aconfiguration manager, the configuration manager comprising a pluralityof master nodes in a master cluster.
 5. The method of claim 4, furthercomprising: in response to detecting an event indicating a loss ofintegrity of the configuration information stored in the master cluster:deleting the configuration information from master nodes of the mastercluster; and selecting a master node among the plurality of master nodesas a new primary master node.
 6. The method of claim 1, wherein a secondmessage among the plurality of messages generated by a second nodeindicates the second node has a second partition with a firstconfiguration version for said second partition, and identifying datanodes for the second partition, the method further comprising:inspecting any messages among the plurality of messages from the datanodes for the second partition; and determining a quorum of data nodesfor the second partition does not exit.
 7. The method of claim 1,further comprising activating the partition in the configurationinformation.
 8. The method of claim 7, wherein the identified quorum ofdata nodes for the partition comprises all of the data nodes for thepartition.
 9. A database system storing a database comprising aplurality of partitions, the system comprising: a plurality of computingnodes; and a network communicably interconnecting the plurality ofcomputing nodes, wherein, the plurality of computing nodes comprise: aplurality of data nodes organized as a plurality of sets, each setcomprising nodes of the plurality of data nodes storing a replication ofa partition of the plurality of partitions; and a plurality of masternodes, each master node storing a replication of configurationinformation, the configuration information identifying the data nodes ineach of the plurality of sets and a partition of the plurality ofpartitions replicated on the nodes of each of the plurality of sets. 10.The system of claim 9, wherein the data nodes for a first partition ofthe plurality or partitions is each configured to generate a firstmessage identifying the first partition as being replicated on saidnode, a configuration version of the first partition, and identifyingeach of the data nodes for the configuration version of the firstpartition, and the plurality of master nodes is configured to perform amethod in response to a reconfiguration triggering event, the methodcomprising: receiving a plurality of the first messages generated by thedata nodes for the first partition; identifying a quorum of data nodesfor the first partition, the data nodes forming said quorum each havinga same configuration version for said first partition; and updating theconfiguration information to indicate, for said first partition, theconfiguration version of the first partition and the data nodes for saidfirst partition.
 11. The system of claim 10, wherein the reconfigurationtriggering event is a loss of quorum among the plurality of masternodes.
 12. The system of claim 10, wherein the reconfigurationtriggering event is a loss of a primary master node among the pluralityof master nodes.
 13. The system of claim 12, wherein: each of theplurality of master nodes are assigned a token on a communications ring;and a new primary master node among a plurality of master nodesremaining after the loss of the primary master node is identified as amaster node having a token spanning a predetermined value.
 14. Thesystem of claim 13, wherein the new primary master node performs themethod.
 15. The system of claim 10, wherein identifying the quorum bythe plurality of master nodes comprises: comparing the configurationversion of the first partition identified by the first message from oneof the data nodes to the configuration version indicated by the firstmessages from one or more other data nodes, the one or more other datanodes being the data nodes identified by the first message from the oneof the data nodes as being the data nodes for the configuration versionof the first partition.
 16. A computer-readable storage mediumcomprising computer-executable instructions that, when executed by acomputer system, perform a method, the method comprising: identifyingthe computer system as a primary node for a master partition; deletingany existing data for a global partition map; receiving a plurality ofmessages from at least a subset of a federation of nodes, each messagegenerated by a node among the subset and indicating for said node apartition replicated on said node, a configuration version of thepartition, and data nodes for the partition; identifying a quorum ofdata nodes for a first partition, the data nodes forming said quorumeach having a same configuration version for said first partition; andupdating the global partition map to indicate, for said first partition,the configuration version of the first partition and the data nodes forsaid first partition.
 17. The computer-readable storage medium of claim16, wherein the method further comprises: analyzing each of theplurality of messages to determine if the configuration version for thepartition replicated by the respective node is part of a quorum for saidpartition.
 18. The computer-readable storage medium of claim 16, whereinidentifying the computer system as the primary node comprisesdetermining the computer system has a token spanning a predeterminedvalue.
 19. The computer-readable storage medium of claim 16, wherein themethod further comprises: sending a request to a plurality of nodes inthe federation for each to provide a respective message among theplurality of messages.
 20. The computer-readable storage medium of claim16, wherein identifying the quorum of data nodes comprises: comparingthe configuration version of the first partition identified by the firstmessage from one of the data nodes to the configuration versionindicated by the first messages from one or more other data nodes, theone or more other data nodes being the data nodes identified by thefirst message from the one of the data nodes as being the data nodes forthe configuration version of the first partition; and determining thatat least half of the data nodes for the configuration version of thefirst partition have the same configuration version for said firstpartition.